London Crimes: Spatial Data

Python 3
Clément Tailleur - Geo Spatial Data



• Summary

I - Libraries
II - Load Data
   A - Deal wit NaN
   B - Deal with dates
   C - Deal with coordinates
III - First Visualization
IV - Machine Learning on Geo Spatial Data
   A - KMeans for spatial visualization
   B - Work with specific crimes
V - DBSCAN - London drugs possession warning
   A - Learning part
   B - Visualization part



I - Libraries

In [1]:
from sklearn.metrics import pairwise_distances
from __future__ import unicode_literals
from sklearn.cluster import KMeans, DBSCAN
from scipy.spatial import ConvexHull
from scipy.spatial import Voronoi
import matplotlib.pyplot as plt
from datetime import datetime
from sklearn import metrics
from PIL import Image
import urllib.request
import pandas as pd
import numpy as np
import pytz as tz
import urllib
import wget
import sys
import os

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

II - Load Data

In [2]:
#https://www.kaggle.com/sohier/london-police-records/downloads/london-outcomes.csv
df = pd.read_csv("london_crimes.csv")
df.head()
Out[2]:
Crime ID Month Reported by Falls within Longitude Latitude Location LSOA code LSOA name Outcome type
0 b4adcc899360d595450a35cbe4d7d71d295bafefef98b2... 2014-06 City of London Police City of London Police NaN NaN No location NaN NaN Suspect charged
1 64b14e3efdf9e12425e3ac19f5e72b6d19b5656523d91b... 2014-06 City of London Police City of London Police -0.088843 51.509532 On or near Parking Area E01032739 City of London 001F Investigation complete; no suspect identified
2 d9909143deda1db75d8ba35b701f31b268d9273764ad42... 2014-06 City of London Police City of London Police -0.084701 51.509320 On or near LOWER THAMES STREET E01032739 City of London 001F Investigation complete; no suspect identified
3 bd44c99de1bdc727abc7c682bf3916969a3bc673e93093... 2014-06 City of London Police City of London Police -0.079080 51.519615 On or near BISHOPSGATE E01004307 Tower Hamlets 015B Formal action is not in the public interest
4 f8e3fc7e63baa91ddd6625dd1f4f48203f565fd021d9d4... 2014-06 City of London Police City of London Police -0.104281 51.516032 On or near BEAR ALLEY E01032740 City of London 001G Investigation complete; no suspect identified

III - Clean Data

   A - Deal with NaN

In [3]:
print("Shape with NA:", df.shape)
df = df.dropna()
print("Shape without NA:",df.shape)
Shape with NA: (1947050, 10)
Shape without NA: (1915110, 10)

   B - Deal with dates

In [4]:
df["Timestamp"] = pd.to_datetime(df['Month'], format='%Y-%m', errors='coerce')
df.head()
Out[4]:
Crime ID Month Reported by Falls within Longitude Latitude Location LSOA code LSOA name Outcome type Timestamp
1 64b14e3efdf9e12425e3ac19f5e72b6d19b5656523d91b... 2014-06 City of London Police City of London Police -0.088843 51.509532 On or near Parking Area E01032739 City of London 001F Investigation complete; no suspect identified 2014-06-01
2 d9909143deda1db75d8ba35b701f31b268d9273764ad42... 2014-06 City of London Police City of London Police -0.084701 51.509320 On or near LOWER THAMES STREET E01032739 City of London 001F Investigation complete; no suspect identified 2014-06-01
3 bd44c99de1bdc727abc7c682bf3916969a3bc673e93093... 2014-06 City of London Police City of London Police -0.079080 51.519615 On or near BISHOPSGATE E01004307 Tower Hamlets 015B Formal action is not in the public interest 2014-06-01
4 f8e3fc7e63baa91ddd6625dd1f4f48203f565fd021d9d4... 2014-06 City of London Police City of London Police -0.104281 51.516032 On or near BEAR ALLEY E01032740 City of London 001G Investigation complete; no suspect identified 2014-06-01
5 0bca94e0f7c12d09f829c03317f6d6dd971f159cd6fe1f... 2014-06 City of London Police City of London Police -0.077777 51.518046 On or near SANDY'S ROW E01032739 City of London 001F Investigation complete; no suspect identified 2014-06-01
In [5]:
start = datetime(2015, 1, 1)
end = datetime(2017, 12, 31)
df_cleaned = df[(df.Timestamp > start) & (df.Timestamp < end)].copy()
df_cleaned.shape
Out[5]:
(1535394, 11)
In [6]:
df_cleaned = df_cleaned[['Outcome type', 'Latitude', 'Longitude', 'Timestamp']].copy()
df_cleaned.columns = ['Outcome type', 'Latitude', 'Longitude', 'Timestamp']
df_cleaned.head()
Out[6]:
Outcome type Latitude Longitude Timestamp
3319 Offender given a drugs possession warning 51.523156 -0.097210 2015-02-01
3320 Offender given a caution 51.514763 -0.075508 2015-02-01
3321 Suspect charged 51.511923 -0.086624 2015-02-01
3322 Investigation complete; no suspect identified 51.516207 -0.101794 2015-02-01
3323 Investigation complete; no suspect identified 51.513171 -0.077694 2015-02-01

   C - Deal with coordinates

In [7]:
df_cleaned = df_cleaned[(df_cleaned.Longitude > -0.2)
                        & (df_cleaned.Longitude < -0.02)
                        & (df_cleaned.Latitude > 51.44)
                        & (df_cleaned.Latitude < 51.58)]
df_cleaned.shape
Out[7]:
(545994, 4)

III - First Visualization

In [8]:
def get_map(x, y, z, size, filename) :
    # Load map on staticmap.openstreetmap.de depending on coordinates
    static_map = "http://staticmap.openstreetmap.de/staticmap.php?center={0},{1}&zoom={2}&size={3}x{3}&maptype=mapnik".format(y,x,z,size)
    static_map_filename, headers = urllib.request.urlretrieve(static_map, filename)
    return static_map_filename

def geomap(data, center_coordinates, zoom=13, point_size=3, point_color='r', point_alpha=1):
    z, picsize = zoom, 1000
    wx = 1.0*360*(picsize/256)/(2**z) 
    wy = 0.76*360*(picsize/256)/(2**z)
    lat_center = center_coordinates[0]
    lon_center = center_coordinates[1]
    
    x_min, x_max = lon_center - wx/2, lon_center + wx/2
    y_min, y_max = lat_center - wy/2, lat_center + wy/2
    
    static_map_filename = 'london_staticmap_{}_{}.png'.format(z, picsize)
    if os.path.isfile(static_map_filename) == False:
        get_map(x, y, z, picsize, static_map_filename)

    img = Image.open(static_map_filename)

    #add the static map
    plt.imshow(img, zorder=0, extent=[x_min, x_max, y_min, y_max], interpolation='none', aspect='auto')

    #add the scatter plot of events
    plt.plot(data['Longitude'], data['Latitude'], '.',
             markerfacecolor=point_color, markeredgecolor='k',
             markersize=point_size, alpha=point_alpha)

    #limit the plot to the given box
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
##### END FUNCTION #####

plt.style.use('seaborn-white')
fig = plt.figure()
fig.set_size_inches(20,20)
london_center = (51.510579, -0.109852)
geomap(df_cleaned[['Longitude','Latitude']], center_coordinates=london_center)
plt.show()

IV - Machine Learning on Geo Spatial Data

   A - KMeans for spatial visualization

In [9]:
ml = KMeans(n_clusters=100, init='k-means++')
ml.fit(df_cleaned[['Longitude', 'Latitude']].sample(frac=0.3))
Out[9]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=100, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)
In [10]:
cluster = ml.cluster_centers_
cluster[:10]
Out[10]:
array([[-5.90446484e-02,  5.15407119e+01],
       [-1.26035826e-01,  5.15092029e+01],
       [-1.96247401e-01,  5.15112208e+01],
       [-9.86539340e-02,  5.14912952e+01],
       [-1.22569942e-01,  5.15721672e+01],
       [-2.64236348e-02,  5.14747992e+01],
       [-1.71171957e-01,  5.14713170e+01],
       [-1.14828535e-01,  5.14619858e+01],
       [-1.55714797e-01,  5.15215028e+01],
       [-1.05851935e-01,  5.15640038e+01]])
In [11]:
df_cleaned['cluster'] = ml.predict(df_cleaned[['Longitude','Latitude']])
df_cleaned[['Outcome type','Latitude','Longitude', 'cluster']].head()
Out[11]:
Outcome type Latitude Longitude cluster
3319 Offender given a drugs possession warning 51.523156 -0.097210 79
3320 Offender given a caution 51.514763 -0.075508 11
3321 Suspect charged 51.511923 -0.086624 83
3322 Investigation complete; no suspect identified 51.516207 -0.101794 36
3323 Investigation complete; no suspect identified 51.513171 -0.077694 11
In [12]:
def voronoi_polygons_2d(vor, radius=None):
    """
    Reconstruct infinite voronoi regions in a 2D diagram to finite
    regions.

    Input_args:
    vor : Voronoi
        Input diagram
    radius : float, optional
        Distance to 'points at infinity'.

    :returns:
    regions : list of tuples
        Indices of vertices in each revised Voronoi regions.
    vertices : list of tuples
        Coordinates for revised Voronoi vertices. Same as coordinates
        of input vertices, with 'points at infinity' appended to the
        end.

    """
    if vor.points.shape[1] != 2:
        raise ValueError("Requires 2D input")

    new_regions = []
    new_vertices = vor.vertices.tolist()

    center = vor.points.mean(axis=0)
    if radius is None:
        radius = vor.points.ptp().max()*2

    # Construct a map containing all ridges for a given point
    all_ridges = {}
    for (p1, p2), (v1, v2) in zip(vor.ridge_points, vor.ridge_vertices):
        all_ridges.setdefault(p1, []).append((p2, v1, v2))
        all_ridges.setdefault(p2, []).append((p1, v1, v2))

    # Reconstruct infinite regions
    for p1, region in enumerate(vor.point_region):
        vertices = vor.regions[region]

        if all([v >= 0 for v in vertices]):
            # finite region
            new_regions.append(vertices)
            continue

        # reconstruct a non-finite region
        ridges = all_ridges[p1]
        new_region = [v for v in vertices if v >= 0]

        for p2, v1, v2 in ridges:
            if v2 < 0:
                v1, v2 = v2, v1
            if v1 >= 0:
                # finite ridge: already in the region
                continue

            # Compute the missing endpoint of an infinite ridge

            t = vor.points[p2] - vor.points[p1] # tangent
            t /= np.linalg.norm(t)
            n = np.array([-t[1], t[0]])  # normal

            midpoint = vor.points[[p1, p2]].mean(axis=0)
            direction = np.sign(np.dot(midpoint - center, n)) * n
            far_point = vor.vertices[v2] + direction * radius

            new_region.append(len(new_vertices))
            new_vertices.append(far_point.tolist())

        # sort region counterclockwise
        vs = np.asarray([new_vertices[v] for v in new_region])
        c = vs.mean(axis=0)
        angles = np.arctan2(vs[:,1] - c[1], vs[:,0] - c[0])
        new_region = np.array(new_region)[np.argsort(angles)]

        # finish
        new_regions.append(new_region.tolist())

    return new_regions, np.asarray(new_vertices)

points = cluster
# compute Voronoi tesselation
vor = Voronoi(points)
regions, vertices = voronoi_polygons_2d(vor)
In [13]:
## PLOT
# prepare figure
plt.style.use('seaborn-white')
fig = plt.figure()
fig.set_size_inches(20,20)

#geomap
geomap(df_cleaned[['Longitude','Latitude']], london_center, 13, 2,'k', 0.1)

# centroids
plt.plot(points[:,0], points[:,1], 'wo',markersize=10)

# colorize
for region in regions:
    polygon = vertices[region]
    plt.fill(*zip(*polygon), alpha=0.4)

plt.show()

   B - Work with specific crimes

In [14]:
df_cleaned['Outcome type'].value_counts()
Out[14]:
Investigation complete; no suspect identified          395659
Suspect charged                                         63187
Offender given a caution                                16002
Offender given a drugs possession warning               13888
Offender sent to prison                                 10658
Defendant found not guilty                               9229
Offender given community sentence                        7029
Local resolution                                         6383
Offender fined                                           5082
Offender given suspended prison sentence                 4695
Offender given penalty notice                            4465
Offender given conditional discharge                     3037
Court case unable to proceed                             2944
Unable to prosecute suspect                              1498
Offender otherwise dealt with                             766
Offender ordered to pay compensation                      549
Offender deprived of property                             390
Suspect charged as part of another case                   187
Formal action is not in the public interest               151
Offender given absolute discharge                          98
Defendant sent to Crown Court                              82
Further investigation is not in the public interest        12
Action to be taken by another organisation                  3
Name: Outcome type, dtype: int64
In [15]:
plt.style.use('seaborn-white')
fig = plt.figure()
fig.set_size_inches(19, 6)

crime_1 = 'Suspect charged'
plt.subplot(1, 3, 1)
geomap(df_cleaned[df_cleaned['Outcome type'] == crime_1][['Longitude', 'Latitude']], london_center)
plt.title(str(crime_1))

crime_2 = 'Offender given a drugs possession warning'
plt.subplot(1, 3, 2)
geomap(df_cleaned[df_cleaned['Outcome type'] == crime_2][['Longitude', 'Latitude']], london_center)
plt.title(str(crime_2))

crime_3 = 'Offender given suspended prison sentence'
plt.subplot(1, 3, 3)
geomap(df_cleaned[df_cleaned['Outcome type'] == crime_3][['Longitude', 'Latitude']], london_center)
plt.title(str(crime_3))
plt.show()

V - DBSCAN - London drugs possession warning

   A - Learning part

In [16]:
crime = 'Offender given a drugs possession warning'

# eps - The maximum distance between two samples for them to be considered as in the same neighborhood.
# At least 10 points in a cluster
data = df_cleaned[df_cleaned['Outcome type'] == crime][['Longitude','Latitude']]
db = DBSCAN(eps=0.0023, min_samples=10, n_jobs=-1).fit(np.array(data[['Longitude','Latitude']])) 
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
data['dbscan_cluster'] =  db.labels_
data['dbscan_core']    = core_samples_mask
print("Number of clusters: {}".format(len(set(data['dbscan_cluster']))))
data.head()
Number of clusters: 95
Out[16]:
Longitude Latitude dbscan_cluster dbscan_core
3319 -0.097210 51.523156 0 True
3365 -0.092719 51.512652 -1 False
3405 -0.101055 51.513552 4 False
3411 -0.082571 51.514753 1 True
3413 -0.104494 51.509848 2 True

   B - Visualization part

In [17]:
plt.style.use('seaborn-white')
fig = plt.figure()
fig.set_size_inches(20, 20)

empty = pd.DataFrame(columns=['Longitude', 'Latitude'])
geomap(empty, london_center, 13, 2, 'k', 0.1)

unique_labels = sorted(set(data['dbscan_cluster']))
for k in unique_labels:
    xy = data[data['dbscan_cluster'] == k]
    plt.plot(xy['Longitude'], xy['Latitude'], 'kD' if k < 0 else 'o', markersize=7)
plt.show()
In [18]:
plt.style.use('seaborn-white')
fig = plt.figure()
fig.set_size_inches(20,20)

empty= pd.DataFrame(columns=['Longitude','Latitude'])
geomap(empty, london_center, 13, 2,'k',0.1)

# convex hulls for evry cluster 
for k in unique_labels:
    if k>=0:
        xy = data[data['dbscan_cluster'] == k][['Longitude','Latitude']].reset_index(drop=True)
        try:
            hull = ConvexHull(xy.as_matrix())
            for simplex in hull.simplices:
                plt.plot(xy.iloc[simplex]['Longitude'], xy.iloc[simplex]['Latitude'], 'r-', lw=5)
        except:
            pass

plt.show()

MERCI !! THANK YOU !!

Clément Tailleur